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We introduce the notion of de Bruijn entropy of an Eulerian quiver and show how the correspond¬ 
ing relative entropy can be applied to practical string similarity problems. This approach explicitly 
links the combinatorial and information-theoretical properties of words and its performance is supe¬ 
rior to edit distances in many respects and competitive in most others. The computational complex¬ 
ity of our current implementation is parametrically tunable between linear and cubic, and we outline 
how an optimized linear algebra subroutine can reduce the cubic complexity to approximately linear. 
Numerous examples are provided, including a realistic application to molecular phylogenetics. 


I. INTRODUCTION 

String similarity is a fundamental problem touching on computer science, bioinformatics, machine learning, and 
many other areas |25j . Most fast approaches to string similarity (e.g., bag-of-words or kernel methods) are heuristic, 
whereas most theoretically grounded approaches to string similarity (e.g., Kolmogorov complexity methods) are slow. 
In this paper, we discuss a technique that bridges the gap, offering performance that can be tuned between linear 
and (in a sufficiently optimized implementation) subquadratic time while offering a clear interpretation in terms of 
combinatorial and information-theoretical primitives. Our technique is particularly well suited for comparing words 
based on their local structure and is agnostic to global structure, which is particularly interesting for comparing 
strings/words encoding paths through digraphs with cycles (e.g., control flow graphs of computer programs) or for 
streaming data. 

The paper is structured as follows. We begin by establishing notation and graph constructions in pi] be fore dis¬ 
cussing basic combinatorial properties in §III| and our ultimate information-theoretical considerations in §IV[ Finally, 
we outline applications, focusing in particular on molecular phylogeny as an example where an approximate “ground 
truth” furnishes a basis for evaluating the performance of our approach and its comparison with conventional tech¬ 
niques. Appendices on spin models as well as code for reproducing and extending our results are included. 


II. PRELIMINARIES 

We begin with some preliminaries to establish basic definitions and notation. Let n < oo and consider a finite set 
A := {oi,..., a„} which we call an alphabet. A word or string over A of length i is an element of A^\ a symbol is a word 
of length 1. The word w = (u>i ,... ,W() will typically be written as w = Wi.. .W(. With a slight abuse of notation, 
we write £{w) = i. The concatenation of two words w = wi .. .wi and w' = w[ ... w'^, is ww' := ... w^w'^... w'^,. 

A cyclic word or necklace [3T] of length i is the set of cyclic shifts of a word. We shall engage in a minor abuse of 
notation by letting w denote either a word or a cyclic word depending on context. If w is cyclic, Wj := ^od t)+t- 

Recall that a quiver (also known as a multidigraph, directed multigraph, etc.) Q is an ordered pair (V{Q),E{Q)) = 
{V, E) s.t. A is a multiset over U x U [5]. The adjacency matrix A{Q) of Q is defined so that if there are a edges from 
Vj to Vk, then A((5)jfe := o- R is clear that a quiver may be reconstructed from its adjacency matrix and vice versa, 
so that we may write f{Q) = /(A) for a generic function / without any ambiguity so long as either side is defined. 
Furthermore, we may make the implicit identifications Vj = j and Q = A(Q) for convenience. 

For w cyclic and k < £{w), the order k de Bruijn quiver [33] Qk{w) is given by 

• V{Qk{w)) := A'^; 

• E{Qk{w)) := {{wi+j . ..Wk+j,W 2 +j . ..Wk+i+j) : 0 < j < £}. 
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That is, the edges of Qk{w) correspond to the subwords of length (fc + 1) (a/k/a {k + l)-grams) of w, with multiplicities 
counted. Figure shows an example. 



FIG. 1: The cyclic words ATAGTC (L) and AGTATC (R) have identical order 1 de Bruijn quivers and (equivalently) the 
same 2-grams over the alphabet {A,C,G,T}. The quiver edges are annotated with and colored according to j. 

Removing the dashed edges yields quivers for non-cyclic words in the obvious way. 

Remarks. 

• The order 0 de Bruijn quiver of a word w has one vertex corresponding to the empty word and edges corre¬ 
sponding to each symbol in w, with multiplicity. 

• If ui is a n-ary de Bruijn sequence of length n^, then Qk-i{w) is the n-ary de Bruijn graph with n^ edges. 

• Qk{w) is Eulerian (i.e., [strongly] connected and with indegrees equal to outdegrees, so that we may unam¬ 
biguously write deg(t;) for either quantity at vertex v) iff w contains every possible fc-gram (otherwise, there 
are isolated vertices, but we may elide this technicality without comment at times). An Euler circuit on Qk{w) 
corresponds to a Hamiltonian path on Qk+i{w). These properties are why we deal with cyclic words. 

III. COMBINATORICS 

To begin this section, we remark that it is a concrete application (though perhaps of sufficient generality as to 
border on a reformulation) of the so-called transfer matrix method |3Ij . 

Write w w' iff Qk{w) = Qkiw'). It is clear that is an equivalence relation; denote the corresponding 
equivalence class of w by [uijfc. Let Wk{w) := |[w]fe| denote the number of cyclic words with the same order k de 
Bruijn quiver as w. In order to compute Wk{w) it is convenient to consider the adjacency matrix Ak(w) = A(Qk(w)). 

If A is a square matrix, write d{A) for the vector with components given by the diagonal entries of A. If now I 
denotes a vector of ones, then L{A) := d{Al) — A is the Laplacian of A. We recall the following two classical theorems: 

Matrix-tree theorem. Let Q be a quiver. The diagonal cofactors of L{A{Q)) are all equal to each other and to 
the number t(Q) of directed spanning trees of Q oriented towards any fixed vertex. □ 

BEST [de Bruijn, van Aardenne-Ehrenfest, Smith, and Tutte] theorem. Let Q be an Eulerian quiver. 
Then the number c{Q) of Euler circuits of Q is 

c{Q)=t{Q)- n (deg(u)-l)!. □ (1) 

v&V{Q) 


These readily yield the following 


Corollary. [TTIIT^ Let A := Ak{w) correspond to an Eulerian de Bruijn quiver. Then 


Wk{w) = WiA) 


E 

d\ gcd(A) 


(j){d) ■ c{A/d) 
d-{A/d)\ 


( 2 ) 


Here is the totient function, the gcd is defined elementwise and M\ := Jli T^e d = \ term dominates, 

giving the simple and effective approximation W{A) « c{A)/A\. We note that if Q is an Eulerian (not necessarily de 
Bruijn) quiver with adjacency matrix A, then we may still write W (A) or W{Q) for the RHS of However, it is 
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not necessary to directly interpret W in this more abstract context, since any finite Eulerian quiver can be embedded 
in some de Bruijn quiver. 

Sketch of proof. The formula is obvious when gcd(A) = 1, as in this case every cyclic word in [rcjfc corresponds 
to Al Euler circuits. More generally, for (i|TO|gcd(A), the term counts the cyclic words in of period 

ijm with multiplicity Since A J2d\m result follows. □ 


A. Example 1 


Consider w = ABRACADABRA over A = {A, B, C, D, R}. Qi{w) and Q 2 {w) are depicted in figure 



FIG. 2: (L) Qi(w). (R) QaH- 


We have that 


Ai(w) 


Aaa ‘2ab Iac 1a£) Oar\ 

Oba Obb Obc Obd ‘^br 

IcA OcB Occ OcD OcR 

Ida Qdb Odc Odd Odr 

\2ra Orb Orc Ord OrrJ 


(3) 


where for convenience we have annotated matrix entries with subscripts corresponding to the 2-grams. It is easy 
to check that t{Qi{w)) = 4. Meanwhile, the ordered tuple of vertex degrees (i.e., the row or column sums of A) is 
(5, 2,1,1, 2), so n„(deg(n) — 1)! = 24. Therefore c((5i(ri;)) = 4 • 24 = 96. The sum in equation ( p|) ha s only a single 
term, corresponding to d = 1, and so Wi{w) = (0(1) • 96)/(l • (2!)^ • (1!)® • (0!)^^) = 96/8 = 12. |40j By noting the 
cycle structure of Qi{w)^ we can list the 12 elements of [wji by hand. In the lexicographical order inherited from the 
usual order on A, these are: 


1 

A 

B 

R 

A 

B 

R 

A 

C 

A 

D 

A 

2 

A 

B 

R 

A 

B 

R 

A 

D 

A 

C 

A 

3 

A 

B 

R 

A 

C 

A 

B 

R 

A 

D 

A 

4 

A 

B 

R 

A 

C 

A 

D 

A 

B 

R 

A 

5 

A 

B 

R 

A 

D 

A 

B 

R 

A 

C 

A 

6 

A 

B 

R 

A 

D 

A 

C 

A 

B 

R 

A 

7 

A 

C 

A 

B 

R 

A 

B 

R 

A 

D 

A 

8 

A 

C 

A 

B 

R 

A 

D 

A 

B 

R 

A 

9 

A 

C 

A 

D 

A 

B 

R 

A 

B 

R 

A 

10 

A 

D 

A 

B 

R 

A 

B 

R 

A 

C 

A 

11 

A 

D 

A 

B 

R 

A 

C 

A 

B 

R 

A 

12 

A 

D 

A 

C 

A 

B 

R 

A 

B 

R 

A 


It is not difficult to similarly show that 140 (w) = 1. 

Einally, consider w' = BARBARA. This is a degenerate case as the symbols C and D are not present, so that the 
de Bruijn quiver is only componentwise Eulerian: i.e., the in- and outdegrees coincide, but some are zero, so that the 
quiver is not connected. Taking A' = {A, B, i?} (or equivalently and perhaps more straightforwardly, modifying the 
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Laplacian by changing any zeros along the diagonal to ones) to remedy this, we get that t{Qi{w'j) = 4. The ordered 
tuple of vertex degrees is (3, 2, 2), so — 1)! = 2 and c{Qi{w')) = 2-4 = 8. As before, there is a single term 

in the sum for Wi{w') = (^i(l) • 8)/(l • (2!)^ • (1!)^ • (0!)^) = 2. The other element of is BARARBA. □ 


B. Example 2 


Consider A = {0,1}, k = 1, and fix i. If g G {00,01,10,11}, let Xg{w) be the number of times that g occurs 
in w. Because w is cyclic, we must have a:oi = Xio = {i — Xgo — x\\)/2 =: a;*, and Ai{w) = ( x)* )■ [H] We 

have that L{Ai{w)) = x* so t{Qi{w)) = x*. Furthermore, deg(O) = Xqo +3:* and deg(l) = a;* + xn, so 

c{Qi{w)) = a;* • (a;oo + a;* — 1)! • (cc* + Xn — 1)!. It follows after a line or two of algebra that 


Wi{w) = lFi(a;oo,a;*;^) = 


(ccoo + xA){xi, + Xu) 


E 


(t>{d) 


d\ gcd(xoo,a:il,a:^») 

Explicitly, for £ = 16, we have the following table of values, with zeros omitted: [42 

2^00 


/{xoo + x*)/d\ /(a:* + a;ii)/d\ 
\ a;*/d J \ a;*/d J 


IVi 

0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

U 

0 

1 













1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

2 

7 

12 

17 

20 

23 

24 

25 

24 

23 

20 

17 

12 

7 

3 

22 

55 

90 

120 

140 

147 

140 

120 

90 

55 

22 



4 

43 

120 

212 

280 

309 

280 

212 

120 

43 





5 

42 

126 

210 

245 

210 

126 

42 







6 

22 

56 

75 

56 

22 









7 

4 

7 

4 












(4) 


Summing over the table entries shows that there are 4116 distinct cyclic binary words of length 16, a fact which can 
be confirmed via the Cauchy-Frobenius lemma. 

Figure 1^ shows results in the same vein for i = 256. It is evident that IFi behaves very much like a Gaussian, with 
the only significant qualitative difference resulting from the triangular domain. Similar results hold for more general 
contexts, and this fact might enable analytical estimates for W{A). □ 


IV. INFORMATION THEORY 
A. de Bruijn entropy 

Definition. The order k de Bruijn entropy of a cyclic word w is H}^{w) := log Wfe(w). 

As in §III[ we may also consider the entropy of an Eulerian quiver Q or of its adjacency matrix A, written respectively 
H{Q) and H{A). Typically the logarithm will be taken with base \A\ unless otherwise indicated. 

This definition evokes Boltzmann’s physical interpretation of entropy as the logarithm of the number of microscopic 
configurations of a system that are consistent with the system’s macroscopic characterization. Here, the “macroscopic 
characterization” of w is just Qk{w), and “microscopic configurations” are just members of [w]fc. Another perspective 
realizes this definition as an analogue for finite words of the capacity of the discrete noiseless channel d la Shannon 
P5] , or equivalently of the topological entropy of a subshift of finite type El- 


September 11, 2015 


Approved for Public Release; Distribution Unlimited 






Non-Technical Data - Releasable to Foreign Persons 


5 


H, = log2W^ 


black; = log^ W^; red; contours of naive quadratic fit 



FIG. 3: (L) Contour plot of Hi := logj Wi for £ = 256. Inset: plot of Hi values intermittently sampled along the dotted line 

and comparison to a naive quadratic fit. Samples along horizontal lines behave similarly. (R) Black countours: plot of Hi after 
the horizontal transformation xoo i—(1 + ■ xoo- Red contours: naive quadratic fit. Inset: plot of transformed Hi values 

intermittently sampled along the line x* = £/^ and comparison to a naive quadratic fit. Note that the naive quadratic fit is a 
slight overestimate at the peak. 


1. Compression arguments 


Recall now the context of I IIIB In order to completely specify a cyclic binary word w, it suffices to specify both 

• Ai{w), which requires a total of 2 |’log 2 — 1 bits (because it requires [log 2 C\ bits to specify xqo and |"log 2 — 1 

bits to specify x*); 

• The appropriate element of [w]i, which requires at most ^ — log 2 f' bits (because there are roughly 

cyclic binary words of length £). 


In particular, if Hi{w) < i — 2\og2i + 1, then we have the outline of a scheme for losslessly compressing w (the 
generalizations to n > 2 and A: > I are not fundamentally different). Note that while most words are too statistically 
uniform (or more precisely, the adjacency matrices of their de Bruijn quivers have elements that are too similar) to 
be compressed in this way, in practice one is rarely interested in compressing statistically uniform data. Indeed, we 
recall that a standard diagonal argument shows that any fixed compression scheme will fail to compress most data 


The perspective of algorithmic information theory hinted at here will directly motivate the definition of relative de 
Bruijn entropy in §IVB| 


2. Maximally informative values of k 

Although the paper [29] leverages the empirical probability distribution of fc-tuples rather than the more detailed 
notion of de Bruijn quivers, it nevertheless gives strong experimental evidence that the natural heuristic k = [log„ l\ 
is a good approximation for the lower limit of a reasonably narrow range of maximally informative values of k in 
practice. While this paper also discusses upper limits on this range, these depend on the particular word and are of 
less practical interest for the obvious reason that increasing k requires more storage. 


3. Remarks on computational complexity 

A detailed analysis of the complexity of computing the de Bruijn entropy is likely to be both more intricate and 
less informative than an experimental one, owing to the complex relationship between the local statistical behavior 
of words and their corresponding quiver connectivity structure as a function of k (this is particularly true for the 
relative de Bruijn entropy, for which see below). However, we note that the dominant contribution to runtime is a 
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matrix determinant (note that forming the adjacency matrices of quivers of words can be done in linear time), and 
we briefly discuss its complexity here. 

Let ui denote the exponent for the complexity of matrix multiplication/inversion/determinant evaluation (say, 3 or 
perhaps 2.808 in practice, or 2.373 in theory [31]). Now 0{{n^Y) = 0(£) k = uj~^ log„ £+0(1) determines k s.t. 
computing the de Bruijn entropy requires linear time with standard techniques of linear algebra (e.g., computing the 


determinant via LU or QR decomposition as in our current implementation). Meanwhile, as pointed out in (IV A 2 
a reasonable rule of thumb for the maximally informative value of k is [log„ £\. 

These two observations can be combined by thinking of /c as a scale below which where we have complete information 
about the structure of words and of as a scale above which negligible information suitable for comparisons between 
words is discernable in linear time using standard techniques of linear algebra. That is, the computation of a de Bruijn 
entropy can be forced to run in linear time by choosing k = log„^J (or, for that matter, k = 0 ( 1 )), with the 
consequence that this amounts to neglecting not only correlations at scales greater than k (as usual), but also the 
ability to capture statistical fluctuations of any sort at scales beyond Insisting on fc = [log„ £\ means in practical 
terms that our technique requires cubic time in the implementation used here. 

However, it is possible to do better, though for the sake of keeping this paper reasonably circumscribed we will 
confine ourselves here to a brief discussion. The reader will probably have noticed the phrase “standard techniques 
of linear algebra” repeated above, and considered the associated references to matrix decompositions for computing a 
determinant in (what is in practice) cubic time. In fact the determinant can be evaluated in less than 0((n^)^) time: 
it can be done in 0((n^)^) or even 0((n^) log^(n^)) time using so-called black box linear algebra |331I3S|. The key 
here is that a diagonal minor L of the Laplacian has a predetermined sparse structure, so that the oracle x i—>■ Lx can 
be realized in subquadratic time. This faciliates the computation of the characteristic polynomial of L using so-called 
superfast Toeplitz solvers in 0((n^) log^(n*)) time [Hll], from which the determinant follows trivially. |43| 

Thus although our current implementation essentially has cubic time complexity for maximally informative values 
of k, it can already be regarded as having linear complexity for k independent of £^ and a sufficiently optimized 
linear algebra subroutine would yield complexity that is just the product of a linear and a polylogarithmic term in 
the general regime of interest, rendering it competitive with bag-of-words or kernel methods [25j that have linear 
complexity but weaker or more ad hoc theoretical justification. 


B. Entropy of componentwise Enlerian quivers and relative de Bruijn entropy 


Write 


ABA' := (A- A') \/0 + (A' - A)^ V 0, (5) 

where the maxima are taken elementwise. It is easy to see that if A and A! both correspond to componentwise 
Eulerian quivers, then so does A B AL. Indeed, this is the adjacency matrix of the quiver that naturally corresponds 
to A — A' after reversing edges with negative matrix entries. [33] 

With this in mind, let A^l^ be adjacency matrices respectively corresponding to Eulerian quivers so that 

Q ■■= Uj is a componentwise Eulerian quiver with corresponding adjacency matrix A. Define 

W(A) :=YIw{A’‘L) ( 6 ) 

j 


and 


H{A) := log W(A) = H(A^L)_ ( 7 ) 

3 

To avoid degeneracies, we define W(a) = 1 and H(a) = 0, where here a is the 1x1 adjacency matrix corresponding 

to the quiver Tq with a single vertex and a > 0 edges (i.e., loops). This definition extends the prior one from 

Eulerian quivers to componentwise Eulerian quivers. Note that H{A) = H(A^) and (A B A')'^ = A' B A, so that 
H(ABA') = H(A'BA). 

Suppose now that we have two cyclic words w and w' over A. Given w and therefore also Ak(w), all that is needed 
to determine Ak(w') is the difference Ak(w') — Ak(w), or equivalently the two nonnegative matrices 

Akiw\w') := [Afc(w) - Afc(w')] V 0; ( 8 ) 

Akiw'\w) := [Afc(w') - Afc(w)] V 0. (9) 
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In order to completely specify w' given w, it therefore suffices to specify Ak(w\w'), Ak(w'\w), and a number of roughly 
Hk{w') bits. It is clear that Hk+i{w') < Hk{w'). If w' is far from statistically uniform, then there will be some critical 
value k « £{w') s.t. Hk{w') = 0 (note that 77^(u,')_i(rc0 — 0)- this point all the information in w' that is not 
latent in w is encoded in the matrices Ak{'w\w') and Ak{'w'\w). In other words, the conditional Kolmogorov complexity 
K{w'\w) as well as the information distance [11[T7] can be approximated by a function of these matrices. [IS] 

This motivates the following 

Definition. The order k relative de Bruijn entropy of w' given w is Hk{w'\\w) := H{Ak{w,w')), where 
Ak{w,w') := Ak{w) B Ak{w') = Ak{w\w') + Aj{w'\w). More generally, the relative entropy of A' given A is de¬ 
fined as H{A'\\A) ■= H{A' B A). 

Note that (unlike the Kullback-Leiber incarnation of relative entropy for probability distributions) the relative 
entropy of componentwise Eulerian quivers is symmetric. Our experiments have shown that it is however not a 
pseudometric: i.e., it does not satisfy the triangle inequality. Nevertheless, it is straightforward to use the relative 
entropy to derive a pseudometric on a fixed set of words using the results of |5| . 


1. Example 3 

Setting u := ABRACADABRA and v := ABARACARBAD, we have that Ai{u) B Ai{v) = Ai{w), where 
w := ABRABRABRA and Hi{w) = 0, so Hi(u\\v) = 0. |46| Meanwhile, the (Levenshtein) edit distance between 
u and V turns out to equal 5. It is evident at least in this particular case that the relative entropy better captures 
similarity in the local structure of words than an edit distance does. 


2. Example 4 

For 1 < m S N, let w := and w' := (0^1^)™. For k < £, a. straightforward (if somewhat tedious) calculation 

shows that Hk{w\\w') < mlogmk, whereas the (Levenshtein) edit distance between w and w' is 2[m/2j£. That is, 
for k and m fixed we have that Hk{w\\w') = 0(1), whereas the corresponding edit distance is 0(£). 


3. Example 5 

Figure 1^ depicts the relative entropy Hi{'w'\\w) of cyclic binary words w,w' as a function of Sqq and a;(., where 
£ = 256, a:oo = 32, and a:* = 80. 

The relative entropy is zero along the strip |a:(. — a;*| < I (and nowhere else). This is because in the strip, Ai{w) B 
Ai{w') corresponds to coherently inserting and/or deleting only cyclic subwords of the form = 0 ... 0, W( 2 ) = 01, 
and W( 3 ) = I... I. While there are generally many ways to do this, the cyclic subword W(i)W( 2 )W(^) = 0... 01... 1 
satisfies Hi{w(i)W( 2 )W{p,)) = 0. This in turn is a manifestation of the simple cycle structure of Ai{w) B Ai{w') in the 
strip. 

The cycle structure of j4i(ui) B Ai{w') is also behind the more significant phenomenon of relative entropy values 
exceeding £(w) V £{w'). This highlights the constraint that any modified relative de Bruijn entropies of the form 
H{f {Ak{w), Ak{w'))) should be such that f{Ak{w), Ak{w')) is still a componentwise Eulerian adjacency matrix. 


V. APPLICATIONS 
A. Biological systematics 

Molecular phylogenetics-i.e., the analysis of evolutionary relationships based on hereditary molecular 
characteristics-and biological classification of organisms typically focus on comparing DNA sequences [9|. A particu¬ 
larly convenient form of DNA for this purposes is mitochondrial DNA (mtDNA). mtDNA is an extremely economical 
repository of information in that nearly every base pair in human mtDNA is known to code for some protein or RNA 
product, and there is even overlap between coding regions; meanwhile mammalian mtDNA sequences are only on the 
order of 20k base pairs. Moreover, mtDNA is not highly conserved and mutates rapidly. 

In figures and below we show that relative de Bruijn entropy produces results comparable if not superior to an 
edit distance (cf. figure]^ for constructing phylogenetic trees that easily capture most of the evolutionary relationships 
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H^(w'||w) with Xqq = 32 and x. = 80 



FIG. 4: Relative entropy Hi{w'\\w) of cyclic binary words w,w' as a function of Xqq and xi, where £ = 256, *00 = 32, and 
Xt = 80. Regions where Hi > I are shaded (from left to right, they correspond to words close to 1... 1, 0 ... 01... 1, and 
0... 0). The red contour delineates where the sum of the entries oi AA' equals £. This makes it clear that values of the 
relative entropy exceeding £ are due to a very large disparity between the input matrices. An alternative approach guaranteeing, 
e.g., H{w'\\w) < £(w) V £{w') would appear to either require working directly with the cycle spaces of A and A! or the blunt 
invocation of a threshold. 


among primates (cf. figures and from mtDNA sequences alone. Furthermore, the comparative performance of 
the relative entropy is likely to improve in other problem domains that can actually leverage the fact that the relative 
entropy captures local correlations while ignoring global correlations (cf. j |V C ). 

It is worth noting that the technique outlined here is alignment-free [5] , and a suitable implementation optimized for 
speed (which our current implementation certainly is not, cf. 1IV A 3) is a promising candidate tool for bioinformatics. 
In particular, it is an attractive alternative to current techniques such as those in [H Ha IMl ES El and the older 
but perhaps conceptually closer approach of [I6j . We note also that de Bruijn quivers have been considered in the 
context of multiple alignment [MI^EE]- 


B. Textual comparison 

There are numerous downstream applications of textual comparison, e.g. authorship attribution (see, e.g., |27jl. It 
may be fruitful to engage in a dramatic simplification of textual comparison at large scales with relative de Bruijn 
entropy by, e.g., working with the results of part-of-speech tagging (perhaps focusing in particular on function words 
versus content words). Another possibility is to omit or “don’t care” words that do not frequently occur in a corpus 
and treat the words themselves as symbols in a very large alphabet while keeping the order of the induced quivers 
very small. 


C. Behavioral analysis of computer programs 

Notwithstanding the interesting applications discussed above, the raison d’etre of the present paper is the anticipated 
application of its theory to the low-level behavioral analysis of computer programs of unknown provenance. Although 
the general problem of disassembly of binary executable code is undecidable, interactive disassembly and automated 
reverse engineering techniques can facilitate static code analysis. In particular, a disassembled binary executable 
program may be conveniently represented in a platform-independent intermediate language such as REIL |10j or an 
enhanced variant called Power-REIL (PREIL). In turn, individual machine-level instructions (or short sequences of 
them) can be mapped to a reduced set of behaviors, which serve as the symbols in a prospective alphabet. One or 
more of these behaviors or symbols may serve as a flavor of “don’t care”. 

An abstract view of the preceding enterprise consists of annotating (a suitable coarse-graining of) the program’s 
control flow graph with the appropriate behaviors: vertices are annotated with a control flow behavior, and edges 
are annotated with a sequence of other behaviors. Specihc executions of the program correspond to walks in the 
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Tribe 



Family Parvorder Suborder 

Infraorder 




3-5 

CO ® 


Tarsiiformes 


Chiromyiformes 


FIG. 5: “The molecular phylogeny of 186 primates and four species representing the two outgroup orders of Scandentia, 
Dermoptera, and rooted by Lagomorpha”. Figure and quotation taken from |23| and used under the CC-BY-2.5 license (see 
http://creativecommons.org/licenses/by/2.5/). 
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Nycticebus bengalensis 
Nycticebus coucang 
Nycticebus bengalensis 
Loris tardigradus 
Loris lydekkerianus 
Perodicticus potto 
Otolemur crassicaudatus 
Galago senegalensis 
Galago moholi 
Daubentonia madagascariensis 
Tarsius syrichta 
Tarsius bancanus 
Varecia variegata 
Prolemur simus 
Lemur catta 
Hapalemurgriseus 
Eulemur macaco 
Eulemurmongoz 
Eulemur rufus 
Eulemur fulvus 
Eulemur fulvus 
Chelrogaleus medlus 
Lepilemur ruficaudatus 
Lepiiemur hubbardorum 
Propithecus coquereli 
Avahi laniger 
Trachypithecus vetulus 
Trachypithecusjohnii 
Semnopithecus entellus 
Trachypithecus shortridgei 
Trachypithecus hatinhensis 
Trachypithecus obscurus 
Trachypithecus germaini 
Presbytis melalophos 
Pygathrix nigripes 
Pygathrix nemaeus 
Pygathrix cinerea 
Pygathrix cinerea 
Rhinopithecus roxellana 
Rhinopithecus brelichi 
Rhinopithecus sfrykeri 
Rhinopithecus bieti 
Rhinopithecus bieti 
Rhinopithecus bieti 
Rhinopithecus avunculus 
Simias concolor 
Nasalis larvatus 
Procolobus verus 
Piliocolobus badius 
Colobus guere; 
Nomascus siki 
Nomascus leucogenys 
Nomascus gabriellae 
Symphalangus syndactylus 
Hylobates pileatus 
Hylobates lar 
Hylobates agilis 
Pongo pygmaeus 
Pongo abelii 
Pan troglodytes 
Pan paniscus 
Homo sapiens 
Homo sapiens 
Homo sp. 
Homo heidelbergensis 
Gorilla gorilla 
Gorilla gorilla 
Erythrocebus patas 
Chlorocebus sabaeus 
Chlorocebus pygerythrus 
Chlorocebus tantalus 
Chlorocebus aethiops 
Cercopithecus albogularis 
Theropithecus gelada 
Papio ursinus 
Papio kindae 
Papio papio 
Papio cynocephalus 
Papio hamadryas 
Papio anubis 
Lophocebus aterrimus 
Macaca sylvanus 
Macaca thibetana 
Macaca mulatta 
Macaca fascicularis 
Mandrillus sphinx 
Cercocebus chrysogaster 
Leontopithecus rosalia 
Callicebus donacophilus 
Callicebus cupreus 
Chiropotes albinasus 
Cacajao calvus 
Cebus xanthosternos 
Cebus apella 
Cebus albifrons 
Saguinus oedipus 
Callithrix pygmaea 
Callithrix geoffroyi 
Aotus nancymaae 
Aotus lemurinus 
Aotus azarai 
Aotus azarai 
Lagothrix lagotricha 
Ateles belzebuth 
Alouatta caraya 
Saimiri oerstedii 
Saimiri sciureus 
Saimiri boliviensis 
Saimiri boliviensis 



FIG. 6: Automatically generated (cf. j ]B 3[ ) phylogenetic tree using average linkage for fe = 7 relative de Bruijn entropy 
(unnormalized). Mismatches w/r/t the tree in figure [^appear for the suborders Haplorrhini and Strepsirrhini, for the families 
Cebidae and Lorisidae, and for the subfamily Callitrichinae. Note that Lorisiformes here should be labeled as Galagidae, 
but this merely reflects an ambiguity in the input data annotation. Not explicitly shown, but also matched, are the tribes 
Cercopithecini, Colobini, Papionini, and Presbytini, the subfamilies Homininae and Lorisnae, the superfamily Hominoidea, and 
the infraorders Lorisformes (cf. previous comment) and Simiformes. 


September 11, 2015 


Approved for Public Release; Distribution Unlimited 



















































































































































































































Non-Technical Data - Releasable to Foreign Persons 


11 


Daubentonia madagascariensis 
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Nycticebus coucang 
Nycticebus bengalensis 
Perodicticus potto 
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Loris lydekkerianus 
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Galago senegalensis 
Galago moholi 
Tarsius syrichta 
Tarsius bancanus 
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Eulemur fulvus 
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Avahi laniger 
Trachypithecus vetulus 
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Semnopithecus entellus 
Trachypithecus shortridgei 
Trachypithecus hatinhensis 
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Presbytis melalophos 
Pygathrix nigripes 
Pygathrix nemaeus 
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Rhinopithecus roxellana 
Rhinopithecus brelichi 
Rhinopithecus sfrykeri 
Rhinopithecus bieti 
Rhinopithecus bieti 
Rhinopithecus bieti 
Rhinopithecus avunculus 
Simias concolor 
Nasalis larvatus 
Procolobus verus 
Piliocolobus badius 
Colobus guereza 
Nomascus siki 
Nomascus leucogenys 
Nomascus gabriellae 
Symphalangus syndactylus 
Hylobates pileatus 
Hylobates lar 
Hylobates agilis 
Pongo pygmaeus 
Pongo abelii 
Pan troglodytes 
Pan paniscus 
Homo sapiens 
Homo sapiens 
Homo sp. 
Homo heidelbergensis 
Gorilla gorilla 
Gorilla gorilla 
Erythrocebus patas 
Chlorocebus sabaeus 
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Chlorocebus tantalus 
Chlorocebus aethiops 
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Theropithecus gelada 
Papio ursinus 
Papio kindae 
Papio papio 
Papio cynocephalus 
Papio hamadryas 
Papio anubis 
Lophocebus aterrimus 
Macaca sylvanus 
Macaca thibetana 
Macaca mulatta 
Macaca fascicularis 
Mandrillus sphinx 
Cercocebus chrysogaster 
Leontopithecus rosalia 
Saguinus oedipus 
Callicebus donacophilus 
Callicebus cupreus 
Chiropotes albinasus 
Cacajao calvus 
Cebus xanthosternos 
Cebus apella 
Cebus albifrons 
Callithrix pygmaea 
Callithrix geoffroyi 
Aotus nancymaae 
Aotus lemurinus 
Aotus azarai 
Aotus azarai 
Lagothrix lagotricha 
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Alouatta caraya 
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Saimiri sciureus 
Saimiri boliviensis 
Saimiri boliviensis 
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Platyrrhini 


FIG. 7: Phylogenetic tree using average linkage for k = 7 relative de Bruijn entropy (normalized). Note that this and the 
previous figure are quantitatively very similar. The mismatches are the same as in the previous figure, with the addition of 
Lorisinae-, apart from this, matched but unshown taxnomical groupings are also as in the previous figure. 
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Tarsius syrichta 
Tarsius bancanus 
Perodicticus potto 
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Nycticebus coucang 
Nycticebus bengalensis 
Loris tardigradus 
Loris lydekkerianus 
Otolemur crassicaudatus 
Galago senegalensis 
Galago moholi 
Daubentonia madagascariensis 
Cheirogaleus medius 
Lepilemur ruficaudatus 
Lepilemur hubbardorum 
Varecia variegata 
Prolemur simus 
Lemur catta 
Hapalemur griseus 
Eulemur macaco 
Eulemur mongoz 
Eulemur rufus 
Eulemur fulvus 
Eulemur fulvus 
Propithecus coquereli 
Avahi laniger 
Homo sp. 
Homo sapiens 
Homo sapiens 
Homo heidelbergensis 
Nomascus siki 
Nomascus leucogenys 
Nomascus gabriellae 
Symphalangus syndactylus 
Hylobates pileatus 
Hylobates lar 
Hylobates agilts 
Pongo pygmaeus 
Pongo abelii 
Pan troglodytes 
Pan paniscus 
Gorilla gorilla 
Gorilla gorilla 
Macaca thibetana 
Macaca mulatta 
Macaca fascicularis 
Pygathrix nemaeus 
Pygathrix nigripes 
Pygathrix cinerea 
Pygathrix cinerea 
Trachypithecus vetulus 
Trachypithecus johnii 
Semnopithecus entellus 
Trachypithecus shortridgei 
Trachypithecus hatinhensis 
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Presbytis melalophos 
Rhinopithecus roxellana 
Rhinopithecus brelichi 
Rhinopithecus strykeri 
Rhinopithecus bieti 
Rhinopithecus bieti 
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Rhinopithecus avunculus 
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Nasalis larvatus 
Procolobus verus 
Piliocolobus badius 
Colobus guereza 
Erythrocebus patas 
Chlorocebus sabaeus 
Chtorocebus pygerythrus 
Chlorocebus tantalus 
Chlorocebus aethiops 
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Saimiri boliviensis 



FIG. 8: Average linkage for (Levenshtein) edit distance (unnormalized). Note that while the edit distance results match 
Strepsirrhini and Lorisidae, they fail to match the more fine-grained taxa Macaca and Homindae, the latter of which is 
particularly important to members of, e.g., Homo sapiens and is detailed in figure]^ (NB. Cercopithecinae and Cercopithecidae 
are ambiguously labeled here and in previous figures due to the input data annotation and as such are not remarked on for 
comparative evaluation.) 
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Hylobates agilis 
x? Hylobates muelleri 
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FIG. 9: Detail of Hominoidea from figures and[^ respectively. The local restrictions of trees from figures and to 
common species are topologically equivalent (note that the trees have internal symmetries), whereas the local restriction of the 
tree from disagrees with them. 


control flow graph, which induce corresponding sequences of behaviors-i.e., words over the behavior alphabet. Broad 
coverage of the control flow graph can be accelerated through dynamic execution m- At this point one has a set 
of words corresponding to representative program behavior sequences. Comparing these (or extracted subwords) to 
words containing exemplars of bad behavior can then inform security assessments [5] . 

The rationale for using relative de Bruijn entropy here versus a more familar construct such as edit distance is 
straightforward. The relative de Bruijn entropy compares all of the local structure of words while remaining agnostic 
to the global structure. Because of this, it is particularly well suited to comparing words corresponding to different 
paths through a control flow graph, where “motifs” may appear with greater location variability than in biological 
sequences. In particular, it is natural to expect that loops containing complex sub-flow graphs are better suited to 
analysis with relative de Bruijn entropy than an edit or dynamic programming metric on words. 
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Appendix A: Spin models 

In this appendix, we briefly mention the simple connection (via the transfer matrix method) of de Bruijn quivers 
and entropies with the physics of finite one-dimensional spin models. 

A local potential of order A: is a function £ : —>■ K. The corresponding energy of a cyclic word w with £{w) > k is 

£(w) 

£{w) :='^£{wj ...Wj+k-i). (Al) 

t=i 

A probability distribution of the form P(w) = /Z{l3) over cyclic words of some fixed length I defines the 

canonical ensemble of a circular one-dimensional spin models or more generally a Gibbs field [5]. In statistical physics, 
/3 = l/kgT, where ks is the Boltzmann constant, T is the absolute temperature, and the normalizing factor Z(/3) is 
the so-called partition function. It turns out that every quantity of physical interest (e.g., entropy, internal energy, 
free energy, etc.) can be computed from a system’s partition function, and its determination is correspondingly the 
central goal of statistical physics ^D] . 

For example, taking A = {0,1}, aj := 2wj — I, and £{wjWj+i) := —2Jajaj+i — Kaj reproduces the ID (spin- 
1/2) Ising model describing model magnetic systems. Here the spins aj represent magnetic dipoles, J represents the 
strength of the dipole-dipole coupling, and K represents the external magnetic field. Another example is furnished 
by taking A = {A,C,G,T} and defining a local potential by the nearest-neighbor Gibbs free energy parameters 
quantitatively describing oligomeric DNA hybridization [26]. |47| 

The transfer matrix method facilitates the calculation of and its limit as £ —>■ oo. For example, it is a standard 
result that for the ID Ising model 

lim cosh/3AT -I- Je'^dJ sinh^ jdK -|- (A2) 

The particular flavor of transfer matrix method embodied in Q enables the exact calculation of for £ < oo by 
noting that 


Z = Y^ 

W 


Wie-f^^ = Y, 

Xqq,X^ 


(A3) 


along with 


£{w) = £{xoo,x*;^) = 2Kxoo + (4J -£ 2K)x^ - {J + K)t (A4) 

This is superfically rather different than writing an expression of the form Z = Tr(A^) as per a typical application of 
the transfer matrix method in physics, but in fact both approaches turn out to have the same essential content and 
structure Ell- 


Appendix B: Data and figure generation 

NB. MATLAB code detailed in )|^is called throughout this section. 


1 . mrm 


The figures in (IIIB were generated using the following MATLAB commands: 


Al = 4*[2,5;5,4] ; 

L = sum(suiii(Al)) ; % 256 
e = ncin(L/2,L) ; 
ce = ncLn(L/2,L); 
e2 = nan(L/2,L); 
ce2 = nan(L/2,L); 
for j = l:L/2 

for k = 1:(L-1) 

[j k] y, for tracking progress 

A2 = [k-1,j;j,L-(k“l)“2*j]; X same length 
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Aa = max(Al-A2,0); 

Ab = max(A2-Al,0); 

A = Aa^+Ab; 

if any(any(A2<0)) */# then A2 is ill-formed 
else 

e(j,k) = equiverentropy(A); 
ce(j,k) = cequiverentropy(A); 

end 

e2(j,k) = equiverentropy(A2); 
ce2(j,k) = cequiverentropy(A2); 

end 

end 

H = ce2/log(2); 

figure; [C ,h] = contour (H, 16:16: 256, 'kO ; 

clabeKC, ’meuiualO ; 

hold;plot((l:L/2),L/2-(1:L/2)+l,’k: 

set(gca,^XTick’,0:32:256) 

set(gca,^YTick’,0:32:128) 

xlabel(’x_0_0 ^); 

ylabeK ’x_*'); 

title(’H_l = log_2 W_10; 

Hmid = mean(H(:,[L/2,L/2+l]),2); 
ind = findC^isnanCHmid)); 
p = polyfit(ind,Hmid(ind),2); 

figure;plot(l:4: (L/2),Hmid(l:4:end) , ^ko\l: (L/2) ,p(l) + (1: (L/2) ) . “2+p(2) * (1: (L/2) )+p(3) , ’k: O ; 
aLxis ( [0 L/2 0 L] ) ; 
set(gca,^XTick’,[]); 
set(gca,'YTick’,[]); 

xlabel({^x_* with (x_0_0,x_*) intermittently sampled \ ^along vertical domain bisector’}); 
h = legend(’H_l = log_2 W_l’,’naive quadratic fit’,4); 
pvert = p; 

Hmid = mean(H([L/4,L/4+l],:),1); 
ind = findC^isnanCHmid)); 
p = polyfit(ind,Hmid(ind),2); 

figure;plot(ind(l:8:end),Hmid(ind(l:8:end)),’ko’,l:L,p(l)*(l:L).“2+p(2)*(l:L)+p(3),’k:’); 
aLxis ( [0 L L/2 L] ) ; 
set(gca,’XTick’,[]); 
set(gca,’YTick’,[] ); 

xlabel({’x_* with (x_0_0,x_*) intermittently sampled’,’along horizontal domain bisector’}); 
h = legend(’H_l = log_2 W_l’,’naive quadratic fit’,4); 
phorz = p; 

[x,y] = meshgridd :L, 1: L/2) ; 
x2 = x+(x.*y)./(L/2-y); y2 = y; 

ind = intersectCfindC'isncuiCH)) ,find(~isinf (x2))); 
w = griddata(x2(ind),y2(ind),H(ind),x,y); 

figure; [C,h] = contour(w,16:16:L,’k’); 
clabeKC, ’mcuiual’); 
hold;plot(l:L,L/4, ’k—’) ; 

g = phorz(l)*x. '‘2+phorz(2)*x+phorz(3)+pvert(l)*y. '‘2+pvert (2)*y+pvert (3); 

g = g/2; % horz and vert double count 

[C2,h2] = contour(g,[224,240],’r’); 

clabel(C2,’manual’,’Color’,’r’); 

set(gca,’XTick’,0:32:256) 

set(gca,’YTick’,0:32:128) 

xlabel(’(l+x_*/(L/2“X_*))x_0_0’); 

ylabel(’x_*’); 

titleC’black: H_1 = log_2 W_l; red: contours of naive quadratic fit’); 


2. §IV B 3| 

Figure]^ in §IV B 3| was generated using the following MATLAB commands: 


September 11, 2015 


Approved for Public Release; Distribution Unlimited 








Non-Technical Data - Releasable to Foreign Persons 


16 


A1 = 4*[2,5;5,4] ; 

L = suin(suiii(Al)) ; % 256 
ce = ncLn(L/2,L); 
ce2 = nan(L/2,L); 
for j = l:L/2 

for k = 1:(L-1) 

[j k] % for tracking progress 

A2 = [k-1,j;j,L-(k“l)“2*j]; % same length 

Aa = max(Al-A2,0); 

Ab = max(A2-Al,0); 

A = Aa'+Ab; 

if any(any(A2<0)) 

else 

ce(j,k) = cequiverentropy(A); 

end 

ce2(j,k) = cequiverentropy(A2); 

end 

end 

LL = nan(L/2,L); 

LM = nan(L/2,L); 
for j = l:L/2 

for k = 1:(L-1) 

A2 = [k-1,j;j,L-(k-l)-2*j]; % same length 
Aa = max(Al-A2,0); 

Ab = max(A2-Al,0); 

A = Aa^+Ab; 

if any(any(A2<0)) 

else 

LL(j,k) = sum(sum(A)); 

LM(j,k) = 1; 

end 

end 

end 

delta = L/8; 
figure; 

contourf(ce/log(2),[L,L],^k’,’LineColor',’none’); 

colormap(gray); 

hold; 

[C,h] = contour(ce/log(2),delta:delta:2*L,’k’); 
contour(LL,[L,L],’r’) 
caxis([0 L*1.25]); 

set(gca,’XTick’,l:delta:(L+1),’XTickLabel’,0:delta:L) 
set(gca,’YTick’,0:delta/2:L/2) 
xlabel(’X’’_0_0’); 
ylabeK’x’ 

title(’H_l(w’’I Iw) with x_0_0 = 32 and x_* = 80’); 
clabeKC, ’meuiual’) 


3. j fVAl 


The following commands were applied to two input text files: genbankfile and fastafile, which respectively 
contained 109 GenBank and FASTA-formatted mtDNA sequences of a broadly representative set of primates, and 
obtained as described in the code comments in §C 5[ genbankfile was used purely for its annotations; the remainder 
of the analysis used fastafile. The computation required less than 2 hours in a single MATLAB session on a 
standard laptop and automatically generated figures and Figure was generated along similar lines using data 
produced by the second author from a Levenshtein distance routine in Python that is not included here. 


/o/o Preliminciries 

fp = fastapcLTse(fastafile); 

N = length(fp.seq); 

kmax = 7; 

alpha = ’ACGTN’; 

/o/o Radix word quivers 
for k = l:kmax 
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for j = 1:N 

rwq-Ck}-Cj} = radixwordquiver(fp.seq{j},k,alpha); 

[j,k] 7, for tracking progress 

end 

end 

disp(^radix word quivers done’); 

VL Entropies of individual words 
for k = likmax 

Hl{k} = zeros(l,N); 
for j = 1:N 

Hl{k}(j) = cequiverentropy(rwq{k}{j}); 

end 

k 7« for tracking progress 

end 

disp(’individual entropies done’); 

V/„ Relative entropies 
for k = likmax 

H{k} = zeros(N); 
for ja = 1:N 

for jb = ja:N 

M = boxininus(rwq{k}{ja},rwq-Ck}-Cjb}) ; 

H{k}(ja,jb) = cequiverentropy(M); 

end 

[ja,k] Vo for tracking progress 

end 

H{k} = H{k}+H{k}’; 

end 

disp(’relative entropies done’); 

V/„ Construct radix quivers for concatenated words cuid compute their entropy 
7# This is significantly more efficient than forming the radix quivers 
7o directly 
n = length(alpha); 
for k = l:kmax 
for i = 1:N 

u = fp.seq-Ci}; 

Au = rwq{k}-Ci}; 
for j = i:N 

V = fp.seq-Cj}; 

Av = rwq{k}{j}; 

crwq = concatenateradixwordquiver(u,v,Au,Av,alpha); 

Hcrwq{k}(i,j) = cequiverentropy(crwq); 

end 

[i,k] 7« for tracking progress 

end 

7o Symmetrize and avoid double-counting the diagonal 
Hcrwq{k} = Hcrwq-Ck}+Hcrwq{k} ’-diag(diag(Hcrwq{k}) ) ; 

end 

disp(’concatenated radix quivers and entropies done’); 

7o7o Linkages 
for k = l:kmax 

sf-Ck} = squareform(H{k}); 

sfnorm{k)- = squareform(H-Ck}./Hcrwq{k)-); 

tree-Ck} = linkage (sf{k}); 

treeav{k} = linkage(sf-tk},’average’); 

treecp{k} = linkage(sf-tk},’complete’); 

treenorm{k} = linkage (sfnorm-Ck}); 

treenormav-Ck} = linkage (sf norm-Ck}, ’ average ’); 

treenormcp-Ck} = linkage (sf norm-Ck}complete’); 

k 7« for tracking progress 

end 

disp(’linkages done’); 

77o Taxonomic info 
7o genus/species 
for j = 1:N 

11 = max(strfind(fp.desc-Cj}, ’ I’))+2; 

12 = min(strfind(fp.desc-Cj},’mitochondrion, complete genome’))-!; 
desc-Cj} = fp.desc{j}(il: i2); 

[genus{j},remain] = strtok(desc{j},’ ’); 
species-Cj} = strtok(remain, ’ ’); 
genusspecies{j} = [genus-Cj},’ ’,species-Cj}] ; 
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end 

7o Other info 

taxa = genbcuikorganismCgenbankf ile); 

% Exclude taxonomic info above order (for Primates) 
for j = 1:N 

taxa-Cj} = taxa-Cj}(10:end); 

end 

V/„ Produce phylogenetic trees 
phylotree(treeav{kmax},genusspecies,taxa); 
phylotree(treenormav{kmax},genusspecies jtaxa); 
phylotree(treenormcp{kmax},genusspecies,taxa); 


Appendix C: MATLAB code 


1. boxminus.m 


function y = boxminus(Al,A2); 
y = max(Al-A2,0)+max(A2“Al,0)’; 


2. cequiverentropy.m 


function y = cequiverentropy(A); 

7 Computes the entropy (in nats) of a componentwise Eulerian quiver with 
7 adjacency matrix A 

"/X Basic error checking 
if size(A,l)-size(A,2) 

error(^A is not square’); 

end 

degl = sum(A,1); 
deg2 = sum(A,2)’; 
if any(degl“deg2) 

error([’sum(A,1) not equal to sum(A,2),’... 

’so A is not componentwise Eulerian’]); 

end 

if ~any(cLny (A) ) 

y = 0; return; 

end 

"/tL Shrink A by retaining only nontrivial vertices 
ind = f ind(degl) ; 7, = find(deg2); 

A = A(ind,ind); 

7«7o Find (strongly) connected components of A 
see = strongconcoms(A); 
nscc = max(scc); 

7«7o Compute H(A(find(scc==j) ,find(scc==j))) 

H = 0; 

for j = l:nscc 

scej = find(scc==j); 

Aj = A(sccj jsccj); 

Hj = equiverentropy(Aj); 

H = H+Hj; 

end 
y = H; 
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3. concatenateradixwordquiver.m 

function y = concatenateradixwordquiver(u,v,Au,Av,alpha); 

7, Given two compatible outputs of radixwordquiver, this RAPIDLY produces 
7 the quiver corresponding to the concatenation of the underlying words. 

7# That is, given words u, v with corresponding radix word quiver matrices 
7# Au, Av, the RWQ matrix for uv is Auv = Au+Av+B, where B corresponds to 
7o the deletion of 2k edges and the insertion of 2k (typically different) 

7o edges. Note that this is symmetric because the cyclic word uv equals the 
7 cyclic word vu. 

7 alpha is the alphabet used to construct Au and Av. 

7 Example: 

7 il = 56; i2 = 78; 

7 temp = concatenateradixwordquiver(fp.seq{il},fp.seq-Ci2}, . . . 

7 rwq{3}{il}- ,rwq{3}{i2}-, ’ ACGTN’ ) ; 

7 temp2 = radixwordquiver ( [fp. seq{il}, fp. seq{i2}] ,3, ^ACGTNO ; 

7 temp-temp2 7 Should be zero (as when tested at time of writing) 

77 Rudimentary error checking 
s = size(Au); 
n = length(alpha); 
k = fix(log(s(l))/log(n)); 
if any(s-size(Av)) 

error(^incompatible sizes’); 
elseif s(l)-s(2) 

error(’not square’); 
elseif abs(k“log(s(1))/log(n)) > 10““6 
error(’k not an integer’); 

end 

77 Other preliminaries 
nary = n.''( (k-1) :-1:0) ’ ; 

Lu = length(u); 

Lv = length(v); 

7 Map letters to 0:(n-l) 

U = zeros(size(u)); 

V = zeros(size(v)); 
for j = l:n 

U(u==alpha(j))=j-1; 

V(v==alpha(j))=j-1; 

end 

UU = [U,U]; 

VV = [V,V]; 

UV = [U,V]; 

VU = [V,U]; 

7 Initialize sparse matrix 
B = sparse(n''k,n'‘k); 

77 Deletions 
for j = l:k 

delOl = UU((Lu-k+j):(Lu-l+j))*nary; 
delll = UU((Lu-k+j+l):(Lu+j))*nary; 

B(del01+l,delll+l) = B(del01+1,delll+l)-l; 
del02 = VV((Lv-k+j):(Lv-l+j))*nary; 
dell2 = VV((Lv-k+j+l):(Lv+j))*nary; 

B(del02+l,dell2+l) = B(del02+1,dell2+l)-l; 

end 


77 Insertions 
for j = l:k 

insOl = UV((Lu“k+j):(Lu-l+j))*nary; 
insll = UV((Lu“k+j+l):(Lu+j))*nary; 
B(ins01+1,insll+1) = B(ins01+1,ins11+1)+1; 
ins02 = VU((Lv-k+j):(Lv-l+j))*nary; 
insl2 = VU((Lv-k+j+1):(Lv+j))*nary; 
B(ins02+1,insl2+l) = B(ins02+1,ins12+1)+l; 

end 

77 Output 
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y = Au+Av+B; 


4. equiverentropy.m 


function y = equiverentropy(A); 

7, Computes the entropy (in nats) of cui Eulerian quiver with adjacency 
7« matrix A. 

7« Basic care has been exercised to ensure that most of the computation time 
7 is spent on the fundcimental determinant evaluation instead of, e.g., 

7« computing a GCD or the sum of log-factorial terms. There is probably 
7« still room for improvement though. 

"/X Force A to be sparse to enable the sparse LU factorization below 
A = sparse(A); 

"/tL Basic error checking 
if size(A,l)-size(A,2) 

error(^A not square’); 

end 

degl = sum(A,1); 
deg2 = sum(A,2)’; 
if any(degl-deg2) 

error(’A not Eulerian’); 

end 

if any(any(A<0)) 

warning(’negative entry’); y = NaN; return; 

end 

if "anyCcuiy(A)) 

y = 0; return; 

end 

VL Shrink A by retaining only nontrivial vertices 
ind = f ind(degl) ; 7o = find(deg2); 

A = A(ind,ind); 

VL Build table for quicksumlogfactorial function below 
7« The largest degree governs the required table size 
logfac = cumsum(log(l:max(degl))); 

"/tL Find g = gcd(A) non-recursively (for speed) euid construct its divisors 
7« This approach is much faster in MATLAB than a recursive GCD, let alone 
7« one that doesn’t restrict consideration to unique nonzero entries of A. 

7« In, e.g., C this would not be the correct approach 
[J,K] = find(A); 
g = 1; 
if numel(J) 

Vo Form vertical array of relevant entries 
AJK = A(sub2ind(size(A),J,K)); 

Vo Only consider the unique values in AJK (note that these are sorted 
Vo from smallest to largest). NB. This approach appears to be faster 
Vo thcui, e.g., omitting [J,K] = find(A), changing "if numel(J)" to 
Vo "if any(A)" and setting AJK = setdiff (unique(A) ,0); 

AJK = unique(AJK); 

Vo The gcd must be a divisor of AJK(l)—so first form all the divisors 
divAJKl = divisors(AJK(l)); 

Vo Now just look for any remainders under trial division of the entries 
Vo of A by each divisor 
for k = 1:numel(divAJKl) 

if ~mod(AJK,divAJKl(k)) 
g = divAJKl(k); 

end 

end 

end 

div = divisors(g); 
nd = numel(div); 
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7o'/o Totient of each divisor 
phi = ones(l,nd); 
for k = l:nd 

d = div(k); 
fd = factor(d); 
ufd = unique(fd); 
nufd = length(ufd); 
phi(k) = 1; 
for m = 1:nufd 

nm = sumCf d==uf d(in) ) ; 

phi(k) = phi(k)*max((ufd(in) ~ (nra-1) )*(ufd(m)-1) , 1) ; 

end 

end 

7,7 Compute the entropy 

H = 0; 7o The initialization to 0 vs -Inf is ameliorated at the very end 
for k = l:nd 

d = div(k); 

Ad = A/d; 

degd = sum(Ad,2)’; 7« = sum(Ad,l) 

L = diag(degd)-Ad; 

L2 = L(2:end,2:end); 

7o7o Compute log(det(L2)) with a modicum of care 

7o A naive yet nontrivial approach would be to compute 

7o logt = real(trace(logm(full(L2)))) ; 

7o But this is bad: logm requires a full matrix and it’s a 
7o computationally nasty way to do things—though not quite as bad as 
7o logt = logCdet (L2) ) ; 

7o We can do better with a sparse LU decomposition 
7o [LL,UL,PL,QL,RL] = lu(L2) ; 

7o logt = sum(log(abs(diag(UL))))+sum(log(abs(diag(RL)))) ; 

7o though on at least one occasion this gave the error message 
7o "Sparse LU failed." This error was not found on the internet cuid no 
7o real attempt was made to understand it—instead we use the 
7o perfunctory remedy of a try-catch block to deal with it below using 
7o the alternative of the Q-less sparse QR decomposition, e.g. 

7o RL = qr(L2); 

7o logt = sum(log(abs(diag(RL)))); 

7o which folklore suggests is more stable—though testing indicates that 
7o this is also much slower. 

7o 

7o We could go much deeper down the rabbit hole, exploiting the sparsity 
7o and integrality of L with great sophistication by, e.g. using 
7o Chinese remaindering/lifting and/or Smith normal forms) . A partial 
7o bibliography of relevant references follows: 

7o \bibitem{PanYuStewart}Pan, V. Y., Yu, Y. , and Stewart, C. ‘‘Algebraic 
7o and numerical techniques for the computation of matrix 
7o determinants.’’ \emph{Comp. Math. Appl.} {\bf 34}, 43 (1997). 

7o \bibitem{KaltofenVillard}Kaltofen, E. and Villard, G. ‘‘Computing the 
7o sign or the value of the determincuit of an integer matrix, a 
7o complexity survey.’’ Xemph-fJ. Comp. Appl. Math.} {\bf 162}, 133 
7o (2004). 

7o \bibitem{0gita}0gita, T. ‘‘Exact determincuit of integer matrices.’’ 

7o \emph-CProc. 4th. Int. Workshop on Reliable Engineering Computing.} 

7o (2010). 

7o \bibitem{ElsheikhEtAl}Elsheikh, M. \emph-Cet al.} ‘‘Fast computation 
7o of Smith forms of sparse matrices over local rings.’’ \emph-tISAAC} 

7o (2012). 

7o \bibitem{EberlyEtAl}Eberly, W. \emph{et al.} ‘‘Faster inversion and 
7o other black box matrix computations using efficient block 
7o projections.’’ \emph{ISAAC} (2007). 

7o \bibitem{DumasSaundersVillard}Dumas, J.-G., Saunders, B. D., cuid 
7o Villard, G. ‘‘On efficient sparse integer matrix Smith normal form 
7o computations.’’ \emph{J. Symb. Comp.} {\bf 32}, 71 (2001). 
try 

7o NB. It may be advantageous to permute L2, but as yet we don’t 
[LL,UL,PL,QL,RL] = lu(L2); 

logt = sum(log(abs(diag(UL))))+sum(log(abs(diag(RL)))); 
catch err 
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if (strcmp(err.identifier,’MATLAB:sparseLUfactor:internalError’)) 

% Try the supposedly more stable but slower Q-less QR 
RL = qr(L2); 

logt = sum(log(abs(diag(RL)))); 

else 

rethrow(err); 

end 

end 

’/o7. 

logo = logt+quicksumlogfactorial(degd-1,logfac); 
logfAd = quicksumlogfactorial(Ad,logfac); 
logterm = log(phi(k))+logc-log(d)“logfAd; 

H = logplus(H,logterm); 

end 

%% Output 

% "H = log(exp(H)-1)" here comes from initialization to 0 vs -Inf (see the 
% start of the "Compute the entropy" code cell/section) 
y = H+log(l-exp(-H)); 

end % END MAIN FUNCTION 

%% LOCAL FUNCTIONS BELOW 


%% 

function y = divisors(x); 

% Returns the divisors of x 

fX = factor(x); 
nf = numel(fx); 
mask = zeros(2"nf,nf); 
temp = dec2bin(0: ((2'‘nf)-l) ,nf) ; 
divO = zeros (2''nf, 1); 
for j = l:2“nf 
for k = 1:nf 

mask(j,k) = str2num(temp(j,k)); 

end 

divO(j) = prod(max(mask(j*fx,1)); 

end 

y = unique(divO)^; 
end 


%% 

function y = logpIus(loga,logb); 

% Given log(a) = loga and log(b) = logb, returns y = log(a+b). Naively, 
% y = loga+log(l+exp(logb-loga)); 

% However, we need to take a little care to avoid under/overflow. 

X = max(loga,logb); 

X = mindoga,logb); 
y = X+log(l+exp(x“X)); 

end 


%% 

function y = quicksumlogfactorial(M,logfac); 

% Given a nonnegative integer array M, compute the sum of the logarithms of 
% factorials in M in a careful way using the precomputed table 
% logfac = cumsumdogd:mM)), where mM = max(. . .max(M). . .) 

X = 0; 

M = M(find(M>l)); % find(M>l) is (MUCH) better to use than find(M) 
for j = l:numel(M) 

X = x+logfac(M(j)); 

end 
y = x; 
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end 


5. fastaparse.m 


function y = fastaparse(fastafile); 

7# Parses PASTA format text file. The (ccuionical) example used here is a 
7# file containing 616 complete mammalian mitochondrial genomes accessed 
7. 20140124 from the URL 

7« \protect\vrule widthOpt\protect\href{http://www.ncbi.nlm.nih.gov/nuccore/?term=mitochondrion[Definition]+AND}{http://www.ncbi.nlm 
7« complete+genome"[Definition]+AND+srcdb_refseq[Properties]+AND+Mammalia[Or 
7o ganism] 

7. 

7« A smaller example is a file containing 109 complete primate genomes 
7o accessed 20140205 from the URL 

7o \protect\vrule widthOpt\protect\href{http://www.ncbi.nlm.nih.gov/nuccore/?term=mitochondrion[Definition]+AND}{http://www.ncbi.nlm 
7o complete+genome"[Definition]+AND+srcdb_refseq[Properties]+AND+Primates[Or 
7« ganism] 

7o Example usage for mtDNA PASTA files: 

7# fp = fastaparse(f astaf ile); 

7# for j = 1:length(fp.desc) 

7o il = maxCstrfind(fp.desc{j}, ’ I 0)+2; 

7o i2 = min(strfind(fp.desc{j)-,’mitochondrion, complete genome’))-!; 

7« desc{j} = fp.desc{j}-(il:i2); 

7o [genus{j},remain{j}] = strtok(desc{j}, ’ ’); 

7o [species{j},remain{j)-] = strtok(remain{j)-, ’ ’); 

7. seq{j} = fp.seq{j}; 

7# end 

fasta = filereadCfastafile); 
lb = linebreak(fasta); 

L = length(lb); 

7o Get line numbers for description lines 
dline = find(cellfun((§numel,strfind(lb,’>’))); 

N = length(dline); 

7« Phantom line for convenience 
dline = [dline,L]; 

7« Assemble components into sequences and their descriptions 
for j = 1:N 

n = dline(j); 

s.desc{j} = strtrim(lb{n}); 
n = n+1; 
s.seq{j} = ” ; 
while n < dline(j+l) 

s.seq{j} = [s.seq{j},strtrim(lb{n})]; 
n = n+1; 

end 

7o7o POR CONVENIENCE replace any non-ACGT characters with ’N’ 
s.seq{j} = regexprep(s.seq{j)-, ’ [''ACGT] ’,’N’); 

end 
y = s; 


6. genbankorganism.m 


function y = genbankorganism(genbankfile); 

7« Used to extract organism data from a Genbank file 
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gb = filereadCgenbankfile); 
lb = linebreak(gb); 

L = length(lb); 

7# Get line numbers for ORGANISM lines 

orgline = find(cellfun(@numel,strfind(lb,^ ORGANISM ’))); 
7o Get line numbers for REFERENCE lines 

refline = find(cellfun(@numel,strfind(lb,PREFERENCE ’))); 

N = length(orgline); 
for j = 1:N 

7o Get the first REFERENCE line after each ORGANISM line 

nextrefline = refline(find(refline>orgline(j),1)); 

lines = (orgline(j)+l):(nextrefline-1); 

tax{j} = p p; 

for k = 1:numel(lines) 

tax-Cj} = [tax{j},strtrim(lb{lines(k)})] ; 

end 

tax-Cj} = strrep(tax-Cj}, p 
tax-Cj} = strrep(tax-Cj}, p ; 

remain = tax-Cj}; 
k = 1; 

while remain 

[taxa{j}{k},remain] = strtok(remain,p;’); 
k = k-i-l; 

end 

end 

y = taxa; 


7. linebreak.m 


function y = linebreak(x); 

7# Break a string x into nontrivial lines according to return characters. 
7o Newlines are replaced with return characters beforehand. 

7o Recall that \n = char(lO) aind \r = char(13) 

X = strrep(x,char(10),char(13)); 

tworets = strfind(x,char([13 13])); 
while tworets 

x = strrep(x,char([13 13]),char(13)); 
tworets = strfind(x,char( [13 13])); 

end 

rets = find(x==char(13)); 

Ir = length(rets); 

if Ir == 0 

lb{l} = x; 
lb{2} = PP; 

else 

lb{l} = x(l:rets(l)); 
for j = 2:lr 

lb{j} = x((rets(j“l)-t-l) :rets(j)) ; 

end 

lb{lr-H} = x(rets (end)-H : end) ; 

end 

y = lb; 

7# NB. Reassembly is easy, but included explicitly here in a comment for 
7o convenience. Below, if lb = linebreak(x), then x2 = x: 
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7# x2 = for j = 1: length(lb), x2 = [x2,lb{j}] ; end 


8. phylotree.m 


function y = phylotree(tree,labs,taxa); 

7« Automatically produce a rooted phylogenetic tree. 

7« tree should be obtained from linkage of data with labels labs. 

7« taxa = genbankorgcuiismCgenbankf ile); 

YtYo Preliminary plot 
figure; 

[H,T,outperm] = dendrogram (tree,0,’orientation’,’right’,’labels’,labs); 

set(H,’Color’,’k’); 

hold; 

7, Plot geometry 

DAR = get(gca,’DataAspectRatio’); 
xyR = DAR(1)/DAR(2); 
xL = get(gca,’XLim’); 

7«7o Get inverse of outperm 

N = length(outperm); 

temp = sortrows([outperm’,(1:N)’]); 

invperm = temp(:,2)’; 

Y<tYo Rudimentary error checking 
if N-nurael(labs) 

error(’wrong number of labels’); 

end 

7«7o Get children of jth internal node 

7« Nodes are 1:N for leaves and (N+1) : (2*N-1) for internals, from bottom up. 
7« I.e., the shortest branches come from node N+1, etc. 
for j = 1:N-1 

children-[j} = tree (j, 1:2); 
while any(children-[j}>N) 

7o Find next child with children (an "adult") 
ind = find(children{j}>N,l,’first’); 

7o Remove the adult and add its children 
adult = children{j)-(ind); 

children{j} = union(setdiff(children{j)-,adult),tree(adult“N,1:2)); 

end 

end 

7«7o For each internal node, find two key cell arrays of words 
7« A: the intersection of taxa words under the node 
7« B: the union of taxa words NOT under the node 

7« Along the way, find the union alltaxa of all taxa words for the tree 
alltaxa = {}; 

7« This loop builds A and alltaxa 
for j = (N-1):-l:l 

t = taxa{children{j}(l)}; 
alltaxa = union(alltaxa,t); 

A{j} = t; 

for k = 2:length(children'[j}) 
t = taxa{children-[j}(k)}; 
alltaxa = union(alltaxa,t); 

A{j} = intersect(A-Cj},t); 

7B0-[j} = union(B0-[j},t) ; 

end 

end 

7o 7» This loop builds B 
for j = (N-1):-l:l 

otherchildren = setdiff (1 :N,children-Cj}); 

B{j} = O; 

for k = 1:length(otherchildren) 

B{j} = union(B-[j},taxa{otherchildren(k)}); 
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end 

end 


VL For each internal node, its label is the "highest-ranking" member of A\B 
for j = (N-1):-l:l 

'/o The first child is as good as euiy to use... 
t = taxa-Cchildren{j}(l)}; 

C = setdiff (A{j},B-Cj}) ; 

label{j} = ’^; 

for k = length(t):-1:1 

if cLny(strcmp(t-Ck},C)) 
label{j} = t{k}; 

end 

end 

end 


VL Add labels 

dxl = .005*(xL(2)“xL(l)); 

for j = 1:N-1 

if length(label-Cj}) 

xl = tree(j,3)+2*dxl; 

x2 = mean(invperm(children-Cj})); 

text(xl,x2,label{j},’FontUnits^,’normalized’,’FontSize’,1/N); 
x2max = max(invperm(children-Cj})); 
x2min = min(invperm(children-Cj})) ; 
xOl = xl-dxl; 

plot( [xOl,x01],[x2min,x2max],’r’); 

end 

end 

VL Axes ticks/etc. 

set(gca,’FontUnits’,’normalized’,’FontSize’,1/N); 
set(gca,’TickLength’,[0 0]); 

y = label; 


9. radixwordquiver.m 


function y = radixwordquiver(w,k,alpha); 

7« Constructs a de Bruijn quiver of order k for a word w over the alphabet 
7 alpha, whose order is presumed to induce the lex order, which is used for 
7 the adjacency matrix A that is actually returned. 

7 Ex. y = radixwordquiver(’GGATTAATGACTAATCAGC’,l,’ACGT’); 

7 NB. Instead of ’ACGT’ for the last argument, it might be necessary to 
7 use, e.g. ’ACGTN’, cf. fastaparse. 


77 

if numel(setdiff(unique(w),alpha)) 
setdiff(unique(w),alpha) 
error(’wrong alphabet’); 

end 

7 Alphabet radix 
n = length(alpha); 
if n^k > 2“64 

error(’n'^k > 2''64; try wordquiver instead’); 

end 


77 Preliminaries 
L = length(w); 

7 Cheap proxy for cyclic word 
w = [w,w(l:k)]; 

7 Map letters to 0:(n-l) 

W = zeros(size(w)); 
for j = l:n 

W(w==alpha(j))=j-1; 

end 
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% Initialize sparse matrix 
A = sparse(n''k,n'‘k); 

V/„ Initialization 
7« Initial k-gram 
JO = 1; 

J1 = k; 

kg = W(J0:Jl); 

7o indl is the index (starting from zero) for the initial k-gram 
nary = n.( (k-1) :-1:0) ' ; 
indl = kg*nary; 

7o7o Main loop 
for j = 1:L 

indO = indl; 

7o Get upcoming k-gram 
JO = j+1; 

Jl = j+k; 
kg = W(J0:Jl); 
indl = kg*nary; 

7o Update the adjacency matrix 
A(ind0+1,indl+1) = A(ind0+1,indl+l)+l; 

end 

77o Output 
y = A; 


10. strongconcoms.m 


function y = strongconcoms(A); 

7o Strongly connected components of a quiver with adjacency matrix A using 
7o the Dulmage-Mendelsohn decomposition 

[n,n2] = size(A); 

7o7o Basic error check 
if n-n2 

error(^A is not square’); 

end 

7«7o Set zero entries of diagonal to unity so that dmperm works correctly 
A = A-spdiags(diag(A),0,n,n)+speye(n); 

7«7o Dulmage-Mendelsohn decomposition 
[p,q,r,s,cc,rr] = dmperm(A); 
if any(p-q) I any(r-s) 

error(’p-q or r-s have a nonzero entry’); 

end 

7o7o Assign strongly connected components 

see = zeros(1,n); 

for j = 1:(length(r)-1) 

scc(p(r(j):(r(j+l)-l))) = j; 

end 

7o7o Output 
y = see; 


11. wordquiver.m 


function y = wordquiver(w,k); 

7« Constructs a de Bruijn quiver of order k for a word w. Returns a struct 
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7# with fields for the k-grams and the adjacency matrix with corresponding 
7o rows/columns. 

VL Preliminaries 
L = length(w); 

7o Cheap proxy for cyclic word 
w = [w,w(l:k)]; 

7# Initialize cell array for k-grams 
kgs = {}; 

7o Initialize sparse matrix 
A = sparse(1,1); 

7o7o Initialization 
7o Initial k-gram 
JO = 1; 

J1 = k; 

kgl = w(JO:Jl); 

7« Initialize cell array for k-grams 
kgs{l} = kgl; 

7o indl == 1 is the index for the initial k-gram 
indl = find(strcmp(kgl,kgs)); 

7«7o Main loop 
for j = 1:L 
kgO = kgl; 
indO = indl; 

7o Get upcoming k-gram 
JO = j+1; 

Jl = j+k; 
kgl = w(J0:Jl); 

7o See if we’ve encountered the upcoming k-gram before 
indl = find(strcmp(kgl,kgs)); 
nindl = numel(indl); 

if nindl == 1 7« We’ve seen the upcoming k-gram before... 

7o ... so do nothing 

elseif nindl == 0 7» We haven’t seen it yet, so include it... 

indl = length(kgs)+l; 
kgs-Cindl} = kgl; 

7o ...and appropriately expand the adjacency matrix 
A(indl,indl) = 0; 
else 7o Something is amiss 

error(’more than one index’); 

end 

7o Update the adjacency matrix 
A(indO,indl) = A(ind0,indl)+l; 

end 

77o Output 
y.kg = kgs; 
y.A = A; 
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